Skip to content

MVP. Storage policy support for Ozone#9807

Closed
greenwich wants to merge 11 commits intoapache:masterfrom
greenwich:storage_policy
Closed

MVP. Storage policy support for Ozone#9807
greenwich wants to merge 11 commits intoapache:masterfrom
greenwich:storage_policy

Conversation

@greenwich
Copy link
Copy Markdown
Contributor

@greenwich greenwich commented Feb 23, 2026

What changes were proposed in this pull request?

This PR adds storage tiering (MVP-1) to Apache Ozone, enabling bucket-level storage policies that direct new writes to specific storage media (SSD or DISK). It implements the full write path end-to-end across OM, SCM, and DN.

What changes were proposed in this pull request?

Apache Ozone currently has no mechanism for directing data placement based on storage media type. Although DataNodes already report per-volume storage types (SSD, DISK, ARCHIVE) to SCM via heartbeats, this information is never used for placement decisions. All writes land on whichever pipeline SCM happens to pick, regardless of the underlying storage hardware. This means operators with mixed-media clusters cannot separate hot (latency-sensitive) data from cold (throughput-oriented) data across different storage tiers.

This PR introduces storage tiering — bucket-level storage policies that direct new writes to the correct storage media. It implements the full write path end-to-end across OM, SCM, and DN. A design document is included at hadoop-hdds/docs/content/design/storage-policy.md

Policy Model

A new OzoneStoragePolicy enum maps semantic intent to physical StorageType:

Policy Primary StorageType Fallback
HOT SSD DISK
WARM DISK none
COLD ARCHIVE none

The default policy is WARM (DISK), matching current behavior. A StoragePolicyProto enum is added to OmClientProtocol.proto with STORAGE_POLICY_UNSET = 0 so that old data and old clients
are unaffected — unset fields resolve to the server default.

How a Write Works with Storage Tiering

Client: ozone sh bucket create --storage-policy HOT o3://om/vol/bucket

On key write:

  1. OM resolves effective policy: bucket (HOT) → server default (WARM)
  2. OM maps HOT → StorageType.SSD
  3. OM passes StorageType=SSD to SCM's allocateBlock()
  4. SCM filters open pipelines: keeps only those where ALL member nodes
    have SSD volumes (using PipelineStorageTypeFilter)
    ├─ Found → allocate block on that pipeline
    └─ Not found → fall back to DISK, log warning
  5. DN receives CreateContainerRequest with storageType=SSD
  6. DN filters candidate volumes by type, creates container on SSD volume

Changes by Layer

ProtobufStoragePolicyProto enum added. optional storagePolicy fields added to BucketInfo (field 23) and BucketArgs (field 13). optional storageType added to
AllocateScmBlockRequestProto and CreateContainerRequestProto. All fields are optional for backward compatibility.

OM — bucket metadataOmBucketInfo and OmBucketArgs carry a nullable OzoneStoragePolicy field. OMBucketCreateRequest persists the policy on bucket creation.
OMBucketSetPropertyRequest handles policy updates. OzoneManager.getDefaultStoragePolicy() provides the server-side default (configurable via ozone.default.storage.policy).

OM — write-time resolutionOMKeyRequest.resolveEffectiveStoragePolicy() resolves the effective policy at write time using the chain: bucket policy → server default. The resolved
StorageType is passed to allocateBlock(). This method is called from OMKeyCreateRequest, OMFileCreateRequest, and OMAllocateBlockRequest.

SCM — pipeline filtering — A new PipelineStorageTypeFilter utility filters pipelines using a set-based approach: it builds a Set<UUID> of all healthy nodes that have the requested
StorageType, then filters pipelines by checking whether all member nodes are in that set. At scale (2000 pipelines, 200 nodes), this takes ~0.5ms per allocation vs ~3-5ms for a naive
per-pipeline approach. Both WritableECContainerProvider and WritableRatisContainerProvider apply this filter.

SCM — proactive pipeline creation — On a 32-node cluster (16 SSD-only, 16 DISK-only) with EC 3+2, the probability that a randomly formed 5-node pipeline is all-SSD is only ~2.2%. Without
proactive creation, HOT writes would almost always fall back to DISK. When ozone.scm.pipeline.creation.storage-type-aware.enabled=true, BackgroundPipelineCreator iterates over StorageType
values and creates per-type pipelines using SCMCommonPlacementPolicy to select only nodes with the matching storage type. On heterogeneous clusters (every DN has both SSD and DISK), this
config is unnecessary since all nodes qualify for both types.

SCM — fallbackBlockManagerImpl.allocateBlock() wraps the container allocation in a try-catch. If no pipeline matches the primary StorageType and the policy defines a fallback (HOT:
SSD → DISK), it retries with the fallback type and emits a WARN log for monitoring. If no fallback is defined (WARM, COLD) or the fallback also fails, the allocation fails as it does today.

DN — volume selectionKeyValueContainer.create() filters the candidate HddsVolume list by the requested StorageType before passing it to VolumeChoosingPolicy. The
VolumeChoosingPolicy interface itself is unchanged — filtering happens upstream.

CLIozone sh bucket create --storage-policy HOT|WARM and ozone sh bucket update --storage-policy HOT|WARM are added. ozone sh bucket info automatically displays the policy via JSON
serialization (no code change needed).

Scope and Limitations

This PR is scoped to OBJECT_STORE buckets with EC replication. FSO and Ratis buckets are not affected — they continue using default placement. Future work (prefix-level policies, a Mover
tool for migrating existing data, on-demand pipeline creation, S3 x-amz-storage-class integration) is described in the design document.

Configuration

Key Default Description
ozone.scm.pipeline.creation.storage-type-aware.enabled false Enable proactive per-StorageType pipeline creation
ozone.default.storage.policy WARM Cluster-wide default storage policy

Backward Compatibility

All protobuf fields are optional with UNSET = 0 defaults. Old clients ignore new fields. Existing data is unaffected — keys without a policy resolve to WARM (DISK), matching current
behavior. No DB migration is required.

What is the link to the Apache JIRA

Please create an issue in ASF JIRA before opening a pull request, and you need to set the title of the pull
request which starts with the corresponding JIRA issue number. (e.g. HDDS-XXXX. Fix a typo in YYY.)

(Please replace this section with the link to the Apache JIRA)

How was this patch tested?

Unit tests, integration testing, and system testing using the company environment.

@spacemonkd
Copy link
Copy Markdown
Contributor

Thanks for the patch @greenwich. If this is something you are working on, it would be great to have a bit more info on the context, use-case and goals of this PR. Also if you have any reference JIRA for this with the relevant info, it'd be great


HOT(StorageType.SSD, StorageType.DISK),
WARM(StorageType.DISK, null),
COLD(StorageType.ARCHIVE, null);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is StorageType.ARCHIVE in this context? if disk = HDD, what do we use for slower storage type?

Copy link
Copy Markdown
Contributor Author

@greenwich greenwich Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good that you pointed it out; it's not needed here. I guess ARCHIVE comes from the ancient HDFS code. As in our team, we use the following storage types: DISK, SSD, NVME.

From my perspective, it should be:

  • HOT -> NVME
  • WARM -> SSD
  • COLD -> DISK

Technically, NVMe is an SSD, but they are much faster, with different throughput and performance profiles, and we want separate layers for each. So, within our team, we would need to define separate storage for them.

I didn't want to change the policies at this point, but we should. What's your thought?

Also, as a user, I would appreciate the ability to define and configure my own storage policies and storage types, too. We missed it in HDFS, but it might be useful because we use multiple SSD types with different sizes, performance, etc. I would set them to different individual storage types with specific storage policies.

*/
public enum OzoneStoragePolicy {

HOT(StorageType.SSD, StorageType.DISK),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would we call e2e NVMe solution?

Copy link
Copy Markdown
Contributor Author

@greenwich greenwich Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those things definitely need refinement - I responded to your comment above.
Please note it's a Draft MR.

@errose28
Copy link
Copy Markdown
Contributor

Hi @greenwich, I'm not sure all the design/requirements for this feature have been completed to the point where we are ready to add code. Right now it looks like we should continue discussion in #6989 or open a new PR. I have pinged the contributors on that change for the best way forward.

@greenwich greenwich marked this pull request as draft February 23, 2026 21:33
@greenwich
Copy link
Copy Markdown
Contributor Author

greenwich commented Feb 23, 2026

Thanks, everyone, for having a look! I am very sorry, but this MR isn't intended to be public or in the Open state. My bad - I'm moving it to Draft.

I explained my motivation and urgency here: #6989 (comment) cc @errose28

@greenwich greenwich changed the title Storage policy support for Ozone MVP. Storage policy support for Ozone Feb 24, 2026
@greenwich greenwich force-pushed the storage_policy branch 3 times, most recently from 67504c0 to 4004e4a Compare February 28, 2026 06:39
@greenwich greenwich force-pushed the storage_policy branch 2 times, most recently from 7d9cad1 to 5bc9f04 Compare March 4, 2026 06:31
@greenwich greenwich force-pushed the storage_policy branch 2 times, most recently from df684d6 to c0f9eb5 Compare March 16, 2026 04:28
@greenwich greenwich force-pushed the storage_policy branch 3 times, most recently from 0a7151c to 9143dea Compare March 20, 2026 00:34
@greenwich greenwich force-pushed the storage_policy branch 2 times, most recently from 42d1dea to 79ba94f Compare March 20, 2026 04:04
@github-actions
Copy link
Copy Markdown

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions Bot added the stale label Apr 21, 2026
@github-actions
Copy link
Copy Markdown

Thank you for your contribution. This PR is being closed due to inactivity. Please contact a maintainer if you would like to reopen it.

@github-actions github-actions Bot closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants